Lack of self-averaging in neutral evolution of proteins 
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We simulate neutral evolution of proteins imposing conservation of the thermodynamic stabihty of 
the native state in the framework of an effective model of folding thermodynamics. This procedure 
generates evolutionary trajectories in sequence space which share two universal features for all of the 
examined proteins. First, the number of neutral mutations fluctuates broadly from one sequence to 
another, leading to a non-Poissonian substitution process. Second, the number of neutral mutations 
displays strong correlations along the trajectory, thus causing the breakdown of self- averaging of the 
resulting evolutionary substitution process. 

PACS numbers: 87.14.Ee, 87.15. Aa, 87.23. Kg, 87.15.Cc 



Introduction. - The neutral theory of molecular evolution 
is the simplest and most elegant theory of protein evolu- 
tion. It was formulated in the late 1960's [Q to explain 
the high substitution rate of amino acids observed in pro- 
teins of many vertebrates and the large intra-specific ge- 
netic variations between most species. The theory as- 
sumes that most of the amino acid substitutions occur- 
ring in an evolving population do not bring any selective 
advantage but are selectively neutral, maintaining the bi- 
ological activity of the protein at the original level. Such 
neutral mutations are assumed to occur at random at a 
rate ^x, where represents the genomic mutation rate 
and X is the fraction of mutations which happen to be 
neutral. The theory predicts that (i) the rate of substitu- 
tions (mutations which become fixed in the population) 
equals the neutral mutation rate ^x and depends only 
on the protein considered, independent of the size of the 
population and its ecology, and that (ii) the number of 
amino acid substitutions taking place in a time t follows 
a Poisson distribution with mean value pLXt, thus giving 
an explanation to the 'molecular clock' observed in the 
early 1960's [||. Later studies in the 1970's and 1980's 
have revealed, however, that the variance of the substi- 
tution process is larger than its mean value |^ , point- 
ing to an underlying non-Poissonian process. Since then, 
different alternatives have been suggested to explain this 
feature. Some authors have extended the neutral theory 
by including into it slightly deleterious mutations ||^ , oth- 
ers have rejected the neutral theory completely and have 
suggested that most mutations are fixed by positive se- 
lection j^. An interesting proposal within the realm of 
neutral theory is its modification in terms of a fluctuat- 
ing neutral space model 0, which can account for the 
non-Poissonian statistics. 

Progress in the understanding of the folding and ther- 
modynamics of biomolecules has opened the way to assess 
the thermodynamical stability of biomolecules involved 
in evolution through computational methods, thus pro- 
viding powerful tools to complement the traditional pop- 
ulation genetic approach. This structural approach has 
been introduced in the study of neutral networks (i.e.. 



the set of sequences connected by structure conserving 
mutations) of RNA secondary structures Q, and has 
found fruitful applications in the study of protein evo- 
lution Based on this structural approach, we have 
studied a model of neutral evolution denoted structurally 
constrained neutral model (SCN model), in which conser- 
vation of the thermodynamic stability of the native struc- 
ture is imposed . The SCN model yields evolutionary 
trajectories consisting of sequences connected through 
neutral mutations. In this Letter, we show that the frac- 
tion of neutral mutations obtained along the trajecto- 
ries fluctuates strongly, and consequently the SCN model 
produces a non-Poissonian substitution process, consis- 
tent with a fluctuating neutral network scenario and 
with the statistics of protein evolution Moreover, we 
find that the number of neutral neighbors displays strong 
auto-correlations along the trajectories, which deeply in- 
fluence the statistics of protein evolution as we discuss 
below. 

We have studied seven protein folds: myoglobin (Pro- 
tein Database (PDB) code la6g), cytochrome c (PDB 
code 451c), lysozyme (PDB code 31zt), ribonuclease 
(PDB code 7rsa), rubredoxin (both from a mesophilic 
and a thermophilic species, PDB codes liro and Ibrf _A), 
ubiquitin (PDB code lu9a_A), and the TIM barrel (PDB 
code 7tim_A) Although these proteins cover a broad 
spectrum of different biological activities, according to 
the SCN model their neutral evolution occurs in a rather 
similar way, suggesting that our model captures 'univer- 
sal' features of protein evolution. In what follows, we 
concentrate on myoglobin to illustrate this common sce- 
nario. 

The SCN model [|l0[ |l^. - The simulated trajectories are 
started from a given "wild-type" sequence A* of N amino 
acids (residues) that folds onto the native structure C*, 
where both A* and C* are taken from the PDB. The tar- 
get structure C* is kept fixed throughout the simulation. 
Following the neutral theory, a given mutated sequence 
A' is considered to be either neutral, if it folds onto C* 
preserving thermodynamic stability, or lethal otherwise. 
Neither advantageous nor slightly deleterious mutations 
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are considered. At each iteration, we generate all possi- 
ble sequences {A'} obtained through point mutations of 
the current sequence A [|l2j, and determine the fraction 
belonging to the neutral network. One of these neutral 
sequences is chosen at random and becomes the new cur- 
rent sequence. The whole process is iterated, resulting in 
an evolutionary trajectory {Aq = A*, Ai, A2, . . .}. 

To assess the conservation of the thermodynamic sta- 
bility of a test sequence in the target structure, we rely 
on two well-established empirical parameters, the nor- 
malized energy gap a measuring the minimal value 
of the energy gap between an alternative conformation 
and the target one divided by their structural distance, 
and the .Z-score [|l3[, measuring the difference between 
the native energy and the average energy of alternative 
compact conformations in units of the standard devia- 
tion of the energy. A positive and large value of the 
a-parameter ensures both that the target conformation 
has lowest energy and that the energy landscape is well 
correlated, the latter in the sense that conformations very 
different from the native have much higher energy. More- 
over, the native energy allows a rough estimate of the 
folding free energy. Alternative structures are generated 
by aligning in all possible ways the test sequence with 
all non-redundant protein structures in the PDB. Con- 
formational energies are computed through the effective 
energy function described in Ref. Q. The resulting es- 
timate of the thermodynamic stability provides a realis- 
tic genotype to phenotype mapping, although the energy 
function used is approximate and alternative structures 
are non-exhaustively sampled. 

Neutral connectivity. - In the SCN model, the quan- 
tity of interest is the fraction of neutral neighbors Xi = 
Xi/ X^i^-y^ e (0, 1] associated to each sequence A^, as this 
quantity determines the evolutionary substitution pro- 
cess (see below) . Here, Xi is the number of neutral neigh- 
bors of sequence A^, and X^ax is its maximal possible 
value fl^ . Thus, an evolutionary trajectory is an (ide- 
ally infinite) series x = {xf),xi,X2, ■ ■ ■}■ For each po- 
sition k, we define analogously evolutionary trajectories 



x*^*^-*, where x')^'^ counts the fraction of the 18 possible 
changes at position k which are neutral. The probabil- 
ity distribution P{x), where P{x)dx is the probability 
to find a fraction of neutral neighbors between x and 
X + dx in the trajectory x, obtained for myoglobin is 
shown in Fig. |l|(a) . The fraction of neutral neighbors is 
broadly distributed, reminiscent of a fluctuating neutral 
network 0. 

To get a deeper insight into the mechanism of neutral 
evolution, it is helpful to investigate the auto-correlation 
of the trajectory x. For this purpose, we study the 
'profile' p{n) of walks of n steps, defined as p{n) = 
Sr=i[^«~^]' where x — J^xP{x)dx is the average 
over all Xi in the trajectory (cf. inset in Fig. 0(b)). 
The 'roughness' of the profile tells us about the pres- 
ence or absence of correlations along the trajectory x. 
They can be determined quantitatively by calculating 
the fluctuations of p{n) within a window of width £ as 
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FIG. 1: (a) Probability distribution P{x) of the fraction x 
of neutral neighbors for myoglobin, (b) Analysis of the fluc- 
tuations of X along an evolutionary trajectory of myoglobin, 
displayed as F{£) vs £. We present the analysis of the whole 
protein (triangles) and the analysis for residues 25 (circles) 
and 115 (squares). The full lines have the slopes 0.82 (for the 
triangles, up to ^ « 20), 0.85 (for the circles, up to ~ 10^), 
and 0.9 (for the squares, up to £ « 10^) indicating strong 
correlations, whereas the dashed lines have the slope 1/2, in- 
dicating the absence of correlations. The inset in (b) shows 
part of the profile p{n) vs n for the total fraction of neutral 
neighbors (see text). 



F{£) = {[p{n)-p{n + £)]'^y^^ (see Ref. [[|). If x is a se- 
ries of independent variables, F{£) scales as i?^/^. Instead, 
if the series x is correlated (anti-correlated) over a length 
£q, the scaling is modified into F{£) ~ £'^ with ^ > 1/2 
{v < 1/2) on the same length scale. The information 
provided by F(£) is equivalent to the one given by the 
auto-correlation function, but the former has the advan- 
tage that different regimes and intermediate crossovers 
are more easily detected. 

The plot of the fluctuations of an evolutionary tra- 
jectory X obtained for myoglobin is shown in Fig. |l|(b), 
together with the analogous analysis of the fraction of 
neutral neighbors for residues 25 and 115, respectively 
the residues with the largest and the smallest mean frac- 
tion of neutral neighbors. The fraction of neutral neigh- 
bors for the whole protein is strongly auto-correlated, 
with an apparent exponent v « 0.82 for about 20 evo- 
lutionary steps, after which the random walk exponent 

— 1/2 is attained. Single residues have a somewhat 
larger apparent exponent between ly sa 0.85 (residue 25) 
and v Ri 0.9 (residue 115) for about 10^ to 10'^ evolution- 
ary steps, which is roughly the product of number of steps 
it takes the correlations to vanish for the whole protein 
times the protein length. These auto-correlations can be 
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qualitatively explained by the fact that more stable se- 
quences have a larger number of neutral neighbors (they 
are more tolerant to mutations), and stability itself is 
auto-correlated along an evolutionary trajectory as long 
as the number of mutations is small with respect to se- 
quence length. These correlations, although being short 
range, have an important influence on the evolutionary 
substitution process, as shown below. 
Modeling the substitution process. - In order to construct 
the substitution process, we have to obtain the number of 
mutations accepted within a time interval t. The substi- 
tution process consists of three steps: (i) generation of an 
evolutionary trajectory using the SCN model; (ii) deter- 
mination of the number k of mutation events taking place 
within the time i, which is assumed to be a Poissonian 
variable of average /it, so that the probability of k muta- 
tions within time t is Pt{k) — exp{—^t) (/ii)'^/fc!; (iii) de- 
termination of the number n of accepted mutation events 
out of k, where the corresponding conditional proba- 

bility Pacc(r^ I k) = [nr=i^dE{™,}n;=i(i-^.)™^- 

Here, the {rrij} are all integer numbers between zero and 
k — n satisfying X)j=i ™i = k — n. In other words, the 
probability that a mutation is accepted is xq — a;(Ao) 
as long as the protein sequence is Aq, xi = a;(Ai) as 
long as the sequence is Ai, and so on. If all Xi = x 
are equal, Pa.cc{n \ k) reduces to a binomial distribu- 
tion Pace I k) = Q x" {I - x)''-" . The probabil- 
ity Ilt{n) that n mutations are accepted within time 
t is the weighted sum over k of the acceptance prob- 
ability, Ilt{n) = J2k>nPtik) Ps^ccin I k), which in the 
case of equal Xi reduces to a simple Poisson distribution 
n((n) = exp{—^tx) (fitx)"" /n\ with average value ^xt and 
substitution rate fix^ as in the original neutral model. 
Statistics of the substitution process. - In our analysis, 
two kinds of averages must be distinguished. We indicate 
by angular brackets ( • ) the average over the mutation 
and acceptance process for a given realization of the evo- 
lutionary trajectory, and by an overline • the average 
over evolutionary trajectories. We determine the mean 
number of substitutions {St) within a time t, and show 
this quantity in Fig. ^ together with the normalized mu- 
tation variance i?^ = ((<5'^) — (S'()"')/(S't), the normal- 
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ized trajectory variance — {{St) — (St) )/(St), and 

the normalized total variance R = R^ + R^. The latter 
quantity is also known as dispersion index. Notice that if 
all Xi — X are the same, one gets i?^ = 1 as for all Pois- 
sonian processes, and the normalized trajectory variance 
Rx — 0, since x is constant in all sequence space. 

The results of the substitution process based on the 
evolutionary trajectories are shown in Fig. ^(a). We also 
show in Fig. ^(b) results based on annealed trajectories, 
obtained by extracting at random the values of Xi at each 
substitution event according to the observed distribution 
P{x). In this case, the different Xi along the trajectories 
are independent variables. Note that the annealed trajec- 
tories 'interpolate' between the Poissonian case (all 
equal) and the correlated trajectories obtained through 



o< 



Co 



(a) 



(b) 



OOOCOOCCOOC>DDC>CXXXXXXXXmOODDOC^^ 



60 120 180 

FIG. 2: Statistics of the substitution process for myoglobin, 
shown are the average number of substitutions (St) divided by 
fit (circles), the normalized mutation variance (squares), 
the normalized trajectory variance Rx (triangles), and the 
normalized total variance R = J?^ -I- Rx (diamonds). We 
present in (a) the statistics for the trajectories obtained 
through the SCN model (full symbols), and in (b) the same 
for the corresponding annealed trajectories (open symbols). 



the SCN model: (i) The comparison between the an- 
nealed trajectories and the simple Poissonian case allows 
us to identify the effect of the broad distribution of the 
fraction of neutral neighbors, whereas (ii) the comparison 
between the actual and the annealed trajectories allows 
us to identify the effect of correlations. 

In the annealed case, the time spans r between subse- 
quent substitutions are independent variables distributed 
with the density D(t) = P{x) (ij,x)~^ exp{—fixT) dx, 
whose average value is r = P{x) (l^x)^^ dx. Thus, the 

average substitution rate (^St)/t is not constant in time 
as in the Poissonian case. Initially, St is a Poissonian 
variable with average rate fix. At large time, however, 
the rate converges to the smaller value {St)/t « 1/t, 
since the process spends more and more time in sequences 
with small x. Hence, the standard deviation of r is 
slightly larger than its average value, and the normal- 
ized mutation variance i?^ is larger than the Poissonian 
value R^ = 1, although the actual difference is small 
(see Fig. ||(b)). The normalized trajectory variance R^ 
is very small, of the order of the ratio between variance 
and average value of the Xi . 

Using the actual evolutionary trajectories (see 
Fig. 11(a)), we note that the presence of correlations has 
only a weak effect both on the average number of substi- 
tutions {St) and on the normalized mutation variance 
i?^. However, the normalized trajectory variance in- 
creases considerably in response to the correlations, as 
Rx ^ ^ for fit — 200. It even grows with time, al- 
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though more and more sequences are used to compute 
the mutational averages and one could naively expect 
that such averages approach typical values. Hence, the 
large fluctuations between different trajectories caused 
by the strong auto-correlations result in the breakdown 
of self-averaging in the substitution process, in the sense 
that even averaging over an arbitrary long trajectory 
does not give values representative of typical trajectories. 
(We note, however, that the variable {St)/t is still self- 
averaging as its variance vanishes in the long t limit, but 
the normalized variable l^St) /^/t has a variance which in- 
creases with time.) The normalized total variance R be- 
comes larger than 2 already for fjd « 100, in agreement 
with empirical estimates, varying in the range 1 < i? < 5 
for most of the proteins studied jH ^ . Hence, these large 
dispersion indices may be to a large extent due to the 
correlations present in the evolutionary process. 
Conclusions. - We have shown that the evolutionary tra- 
jectories in sequence space generated by the SCN model 
are characterized both by a broad distribution of neutral 
connectivities and by strong correlations, which deeply 
influence the statistics of neutral evolution and the sub- 
stitution process. For example, fluctuations of the evolu- 
tionary rate from one branch of the evolutionary tree to 
the other can obscure lineage effects, i.e. variations of the 
substitution rate among different taxonomic groups. One 
such effect is the generation time effect: Since the natural 
time unit for measuring substitution events is the genera- 
tion time (at which reproduction takes place), the longer 
the latter the slower the substitution rate is expected 
to be. This has been verified by comparing for instance 
substitution rates in rodents and primates | p6| . However, 
the effect is significantly larger for synonymous substitu- 
tions (DNA changes which still code for the same amino 
acid) than for non-synonymous ones. Non-synonymous 
substitutions are superimposed with the large and corre- 
lated fluctuations in the substitution rate that we just 
described, while synonymous ones are not. Thus the 
statistics of neutral substitutions could explain this quan- 
titative difference. Additionally, a better understanding 
of the mechanisms of neutral evolution will help to sin- 
gle out the more interesting cases of positive selection 
as, for instance, changes in the protein function and re- 
sponses to variations of the environment. The best cur- 
rent bioinformatics method to identify such cases of posi- 
tive selection , recently used to study the evolution of 
Drosophila genes |p^ , assumes a neutral substitution pro- 
cess with constant fraction of neutral neighbors x. The 
broad distribution and the correlations that we observe 
can mimick the presence of positive selection, and they 
should be taken into account to improve the performance 
of such methods. 
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